Bioterrorism has created a need for rapid analysis of samples which may contain toxins and other deadly agents. Mass Spectrometry(MS) provides a proteomics tool for accurate and comprehensive profiling of proteins. A software tool which can search for matches against a proteome database is useful for forensic analysis of samples. A software tool called MARLOWE was tested and worked well but failed to identify organisms such as the toxin, Arbrin, that were missing from the KEGG.JP(Kanehisa et al. 2002) database on which it relies. FTP access to KEGG.JP is cost prohibitive. Here we create a database and code modification which use public Uniprot.org proteome database (Consortium 2020). By creating a process to update this database, we can ensure that target organism are identified correctly.
Installed MARLOWE on a 32-core, 500GB ram Ubuntu Linux server with MySQL 8.0.31 to host the UniProt candidate database. MARLOWE packages were modified to run correctly on Linux with R version 4.2.1 using RStudio IDE.
The function of MARLOWE was evaluated on the KEGG and UniProt Databases with 8 data files from biological samples including Fish, Milk, Oyster, Juice and Castor bean. The sample data that has been processed by PEAKS DeNovo assembler to determine the peptides contained in the samples. The organism identified from each MARLOWE run was compared with the actual contents of the sample and performance was evaluated.
Figure 1: Flowchart showing in-silico digestion of protein amino acid sequence to determine peptides.
A minimal “candidate” database with UniProt proteomes for 9 organisms matching the samples was built. This involved downloading FASTA proteome files, parsing with parse_fasta(), then inserting organism identification into the database along with the amino acid sequences for proteins and peptides that result from digesting the proteins with Trypsin. A final step is to upload NCBI taxonomy data for all organisms used to produce the MARLOWE heatmaps. Table 1 shows the organisms that have been inserted and the quantities of proteins and peptides for each. Strong peptides which are present in multiple organisms in a genus are determined for use in the scoring algorithm.
| name | taxon_id | protein_count | peptide_count |
|---|---|---|---|
| Bos taurus | 9913 | 23844 | 652649 |
| Citrus clementina | 85681 | 24934 | 586056 |
| Citrus sinensis | 2711 | 28128 | 572368 |
| Crassostrea gigas | 29159 | 25998 | 687216 |
| Crassostrea virginica | 6565 | 33719 | 876976 |
| Ricinus communis | 3988 | 31219 | 630447 |
| Pseudomonas fragi | 296 | 4324 | 85668 |
| Salvelinus namaycush | 8040 | 35973 | 696618 |
| Chlamydia pneumoniae | 83558 | 1052 | 23031 |
MARLOWE identified 7 out of 8 samples using both KEGG and UniProt. Both databases were not able to identify 555558-DeNovo which may point to an issue in the sample. MARLOWE outputs a HeatMap showing the most likely organisms contained in the sample.
Figure 2: HeatMap Generated by MARLOWE with KEGG DB showing it correctly identified R. communis (castor bean) in the sample with score 303 strong peptides.
Figure 3: HeatMap Generated by MARLOWE with UniProt DB showing it correctly identified R. communis (castor bean) in the sample with score 420 strong peptides.
Figure 4: Results from MARLOWE with 8 samples
We will need to improve the speed of the database building process using parallel computing and multiple servers. The time required to build the sample database on with 9 organisms was about 24 hours. Building a fully functional database will require 10,000-22,000 organisms.
Converting the program to run in batch from the Linux shell would be more efficient and less error prone than using RStudio. Parameters can be specified in the command line.
Currently MARLOWE only supports Trypsin digest. We can construct another version where the proteins have been digested with alternate protease.
This project was a proof of concept to validate the parse_fasta package and the process for building a UniProt sourced candidate database. It has produced accurate and expected results with the test cases.